Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 13 09:55:01 2021"
As a former computer science major, so far everything seems to have been pretty easy. Some problems with the DataCamp, it doesn’t always accept correct R scripts. I expect to learn abour using R to do my statistical work in my Thesis. Previously I have used SPSS, but always seemed to be lacking in the programming side. At least for now, R seems more familiar as a language due to, again, my former computer science experience.
My Diary: https://mvannas.github.io/IODS-project/
My Repository: https://github.com/mvannas/IODS-project
FYI: SSH authentication is now required by GitHub, while the installation of the public key in the Guthub was simple, I used instructions in https://cyberhelp.sesync.org/faq/set-up-gitlab-ssh-key.html to configure RStudio correctly for SSH key. Also if its the first time you PUSH commit to Github and the PUSH asks something, you need to answer ‘yes’ to install the host key.
test_data <- read.csv("data/work_data.csv")
str(test_data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
The dataset is originally collected between 3.12.2014 and 10.1.2015. It a combined survey with questions from ASSIST (Approaches and Study Skills Inventory for Students) and SATS (Survey of Attitudes Toward Statistics). A subset of this data is used in this exercise that contains participant background of gender (binary) and age, and a combined mean score for deep, strategic and surface approach for learning. Also contained in our subset is the patients total points (from ASSIST) and attitude towards statistics (from SATS). After removing cases with zero points, 166 cases remained and 7 variables each.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
test_data %>% group_by(gender) %>% summarise_at(vars(age, points,deep,stra, surf, attitude), list(name = mean))
## # A tibble: 2 × 7
## gender age_name points_name deep_name stra_name surf_name attitude_name
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 24.9 22.3 3.66 3.20 2.83 29.9
## 2 M 26.8 23.5 3.72 2.96 2.70 34.4
p <- ggpairs(test_data, mapping = aes(col = gender), upper = list(continuous = wrap('cor', size = 3)), lower = list(combo = wrap("facethist", bins = 20)))
p
Here you can see the summary statistics and relationships between variables. There are more females than male participants, but there are no immediately observable differences between male and female score or points. Also the age average age of the participants is 24.9 (females) and 26.8 (males) with some outliers in both genders being clearly older.
Deep and surface approaches for learning seem to have a negative correlation for points, while strategic approach has positive correlation. Attitude also has a strong positive correlation to the overall points score.
model <- lm(points ~ attitude + stra + surf, data = test_data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Based on the previous correlation estimates, I selected attitude, strategic learning approaches score and surface approach learning score for multiple regression analysis. This summary shows that attitude has statistically the most significant, however estimated coefficient size is smallest (0.33952). It should be noted that this score is has not been averaged to 0-5 scale similarly to deep, stra and surf scores. Two options exist, you can either scale the explanatory variables to same or you can calculate standardized coefficient estimates for which there are ready R libraries.
library(QuantPsyc)
## Loading required package: boot
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
##
## norm
lm.beta(model)
## attitude stra surf
## 0.42039820 0.11170276 -0.05257769
The surf score seems to have the least estimate of coefficient, so I remove it from the equation and recalculate.
model <- lm(points ~ attitude + stra, data = test_data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Reducing the amount of variables to just attitude and stra, but still stra score does not seem to be statistically significant (using p under 0.05 for statistical significance). Removing surf score did increase adjusted R-squared, which means this model is a better fit the observed data.
model <- lm(points ~ attitude, data = test_data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Running the model with just attitude results in R-squared being lower. Thus I would conclude that the previous model with stra and attitude as the explanatory models is a better fit.
The multiple R-squared combines the variation of different explanatory variables in the model. Higher score is better, up to 1. 0 means the explanatory parameter do not explain any variation of observed values and 1 means they explain all the variations. The adjusted R-squared takes into account how much an added explanatory variable should increase the value, thus it grows only if the added variable add something useful to the model over pure chance.
Our linear model has some assumptions that the data used to model has to fulfill. Firstly, the model is linear, so the relationship between explanatory variables and target must be linear. Also usually it is assumed that the errors are normally distributed. The validity of these assumptions can be explored by looking at the residual plots.
model <- lm(points ~ attitude + stra, data = test_data)
par(mfrow = c(2,2))
plot(model, which = c(1,2,5))
Residuals vs Fitted values plot is used to check the model for constant variance of errors. Here in the top left panel we see our residuals vs fitted graphics. The scatter plot seems equally scattered without any obvious patterns. Therefore we conclude that our model when used with our data does have fairly constant variance of errors.
The normal Q-Q plot is used to check that our errors are normally distributed. The plot should show fairly little deviation from the line to be pass this check. Again in our model with our data it would seem that the errors are normally distributed.
The Residuals vs Leverage plot shows the residuals and Cook’s distance. Cook’s distance is estimation of influence of a observation on the parameters of the model. Usually values over 1 are suggestive of undue influence. Our model shows that while here are few observations that seem to have slightly more influence, the Cook’s value is still very low and thus there are no observation with undue power over the model.
d <- read.csv("data/data_excrs2.csv")
colnames(d)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "alc_high"
This dataset contains information on student grades, demographic, social and school related features. Joined information was collected from students in two courses, math and Portuguese language. Grades, absences, past failures of courses and extra paid courses were combined by using either mean of two values or for binary (yes/no) variables the math course value. All other values were assumed to be same.
This data was obtained from https://archive.ics.uci.edu/ml/datasets/Student+Performance#. More detailed information can be found there.
Target value of this exercise is the high/low consumption of alcohol. The explanatory variables selected for this excercise by the author are students age, sex, status of parents cohabitation (together or apart) and romatic relationship status (yes/no). Hypothesis is that single young males from families with parents livings apart would be more likely to use larger amounts of alcohol.
library("GGally")
library("ggplot2")
library("dplyr")
d_sub <- dplyr::select(d, sex, age, romantic, Pstatus, alc_high)
# Convert romatic and Pstatus to boolen, Pstatus TRUE if together
d_sub <- d_sub %>% mutate(romantic=(romantic == "yes"), Pstatus=(Pstatus=="T"))
p <- ggpairs(d_sub, mapping = aes(col = sex), upper = list(continuous = wrap('cor', size = 3)), lower = list(combo = wrap("facethist", bins = 20)))
p
summary(d_sub)
## sex age romantic Pstatus
## Length:370 Min. :15.00 Mode :logical Mode :logical
## Class :character 1st Qu.:16.00 FALSE:251 FALSE:38
## Mode :character Median :17.00 TRUE :119 TRUE :332
## Mean :16.58
## 3rd Qu.:17.00
## Max. :22.00
## alc_high
## Mode :logical
## FALSE:259
## TRUE :111
##
##
##
d_sub %>% group_by(sex) %>% summarize(n = n(), mean(age))
## # A tibble: 2 × 3
## sex n `mean(age)`
## <chr> <int> <dbl>
## 1 F 195 16.6
## 2 M 175 16.6
From the summary diagram above we can see the distribution of the chosen varibles. Male to female ratio is fairly even, students are mostly young. Very few students have parents living apart, and almost twice as many singles as in a relationship. Proportions in the boxplots for alcohol usage initially look similar to variable distributions in the data so thereis no clear evidence to support my hypothesis.
m <- glm(alc_high ~ sex + age + romantic + Pstatus, data = d_sub, family = "binomial")
summary(m)
##
## Call:
## glm(formula = alc_high ~ sex + age + romantic + Pstatus, family = "binomial",
## data = d_sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4115 -0.8927 -0.6684 1.2258 1.9839
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.71157 1.67409 -2.814 0.00489 **
## sexM 0.92425 0.23618 3.913 9.1e-05 ***
## age 0.21613 0.09929 2.177 0.02950 *
## romanticTRUE -0.17882 0.25652 -0.697 0.48575
## PstatusTRUE -0.16909 0.38337 -0.441 0.65916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 431.07 on 365 degrees of freedom
## AIC: 441.07
##
## Number of Fisher Scoring iterations: 4
odd_ratios <- coef(m) %>% exp()
odds_ci <- confint(m) %>% exp()
## Waiting for profiling to be done...
cbind(odd_ratios, odds_ci)
## odd_ratios 2.5 % 97.5 %
## (Intercept) 0.008990637 0.0003198605 0.230918
## sexM 2.519974630 1.5933393547 4.028052
## age 1.241266650 1.0232353008 1.511969
## romanticTRUE 0.836260504 0.5017997940 1.374988
## PstatusTRUE 0.844429463 0.4050871115 1.842945
As we see here, males seem to have a higher tendency for higher alcohol usage, but slightly surpricingly older age in our student group also made it more likely to have higher alcohol consumption. Romatic status or parental status did not make sufficient difference, while it seems they do have some effect (both have coefficient under 1) to the direction of my hypothesis, this is not significant. Confidence interval includes 1 in both cases making them not significant. Thus my hypothesis is wrong in our data model.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
p <- predict(m, type = "response")
d_sub <- mutate(d_sub, alc_high_probability = p)
d_sub <- mutate(d_sub, alc_high_predicted = alc_high_probability > 0.5)
table(high_use = d_sub$alc_high, prediction = d_sub$alc_high_predicted)
## prediction
## high_use FALSE TRUE
## FALSE 255 4
## TRUE 107 4
loss_func(class = d_sub$alc_high, prob = d_sub$alc_high_probability)
## [1] 0.3
loss_func(class = d_sub$alc_high, prob = 1)
## [1] 0.7
library("boot")
cv <- cv.glm(data = d_sub, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3081081
My model seems to show 30% of inaccurately classified students. Looking at the 2x2 it seems that my model very seldomly predicts high alcohol usage. Assuming that there no students with high alcoholc consumption seems to result in the same amount of inaccurately classified students.
Our next exercise uses on the ready dataset available, the Boston-dataset. This dataset has information on the city of Boston with descriptive variables for observed suburbs.
library(MASS)
library("ggplot2")
library("GGally")
library("corrplot")
## corrplot 0.90 loaded
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.5 ✓ purrr 0.3.4
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
library("vtable")
## Loading required package: kableExtra
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
There are 506 observations with 14 different variables.
| Variable | Explanation |
|---|---|
| rim | per capita crime rate by town |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
| nox | nitrogen oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling |
| age | proportion of owner-occupied units built prior to 1940 |
| dis | weighted mean of distances to five Boston employment centres |
| rad | index of accessibility to radial highways |
| tax | full-value property-tax rate per $10,000 |
| ptratio | pupil-teacher ratio by town |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
| lstat | lower status of the population (percent) |
| medv | median value of owner-occupied homes in $1000s |
Information on variables from https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
ggpairs(Boston, upper = list(continuous = wrap('cor', size = 2)))
ggcorr(cor(Boston), geom = "circle", nbreaks = 11)
sumtable(Boston)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| crim | 506 | 3.614 | 8.602 | 0.006 | 0.082 | 3.677 | 88.976 |
| zn | 506 | 11.364 | 23.322 | 0 | 0 | 12.5 | 100 |
| indus | 506 | 11.137 | 6.86 | 0.46 | 5.19 | 18.1 | 27.74 |
| chas | 506 | 0.069 | 0.254 | 0 | 0 | 0 | 1 |
| nox | 506 | 0.555 | 0.116 | 0.385 | 0.449 | 0.624 | 0.871 |
| rm | 506 | 6.285 | 0.703 | 3.561 | 5.886 | 6.624 | 8.78 |
| age | 506 | 68.575 | 28.149 | 2.9 | 45.025 | 94.075 | 100 |
| dis | 506 | 3.795 | 2.106 | 1.13 | 2.1 | 5.188 | 12.126 |
| rad | 506 | 9.549 | 8.707 | 1 | 4 | 24 | 24 |
| tax | 506 | 408.237 | 168.537 | 187 | 279 | 666 | 711 |
| ptratio | 506 | 18.456 | 2.165 | 12.6 | 17.4 | 20.2 | 22 |
| black | 506 | 356.674 | 91.295 | 0.32 | 375.378 | 396.225 | 396.9 |
| lstat | 506 | 12.653 | 7.141 | 1.73 | 6.95 | 16.955 | 37.97 |
| medv | 506 | 22.533 | 9.197 | 5 | 17.025 | 25 | 50 |
From the summary statistics we can observe that the variables have wide range of values, chas is only a twofold category with values 0 or 1, to a range of 0 to 400 (black population proportion). With the exception of chas, most of the variables seem to have fairly strong correlations with at least some of the other values.
bsca <- scale(Boston)
bsca <- as.data.frame(bsca)
sumtable(bsca, add.median = TRUE)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 50 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|---|
| crim | 506 | 0 | 1 | -0.419 | -0.411 | -0.39 | 0.007 | 9.924 |
| zn | 506 | 0 | 1 | -0.487 | -0.487 | -0.487 | 0.049 | 3.8 |
| indus | 506 | 0 | 1 | -1.556 | -0.867 | -0.211 | 1.015 | 2.42 |
| chas | 506 | 0 | 1 | -0.272 | -0.272 | -0.272 | -0.272 | 3.665 |
| nox | 506 | 0 | 1 | -1.464 | -0.912 | -0.144 | 0.598 | 2.73 |
| rm | 506 | 0 | 1 | -3.876 | -0.568 | -0.108 | 0.482 | 3.552 |
| age | 506 | 0 | 1 | -2.333 | -0.837 | 0.317 | 0.906 | 1.116 |
| dis | 506 | 0 | 1 | -1.266 | -0.805 | -0.279 | 0.662 | 3.957 |
| rad | 506 | 0 | 1 | -0.982 | -0.637 | -0.522 | 1.66 | 1.66 |
| tax | 506 | 0 | 1 | -1.313 | -0.767 | -0.464 | 1.529 | 1.796 |
| ptratio | 506 | 0 | 1 | -2.705 | -0.488 | 0.275 | 0.806 | 1.637 |
| black | 506 | 0 | 1 | -3.903 | 0.205 | 0.381 | 0.433 | 0.441 |
| lstat | 506 | 0 | 1 | -1.53 | -0.799 | -0.181 | 0.602 | 3.545 |
| medv | 506 | 0 | 1 | -1.906 | -0.599 | -0.145 | 0.268 | 2.987 |
After using the R scale-fundtion we have how standardized the values. Looking at the standardized summary table, it seems fairly even, however the crim variables seems to have a fairy large variance as the maximum stardardized value is three times larger than in any other category.
# Create quatiles from crim value as separate categories
crime <- cut(bsca$crim, breaks = quantile(bsca$crim), label =c("low","med_low", "med_high", "high"), include.lowest = TRUE)
bsca <- dplyr::select(bsca, -crim)
bsca <- data.frame(bsca, crime)
# Create test and training sets (80 % to train set)
ind <- sample(nrow(bsca), size = nrow(bsca) * 0.8)
train <- bsca[ind,]
test <- bsca[-ind,]
# Copy correct crme categories from test set and remove from set
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
# Shamelessly copied from Datacamp excercise
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2376238 0.2623762 0.2425743
##
## Group means:
## zn indus chas nox rm age
## low 0.90090711 -0.9142204 -0.083045403 -0.8768021 0.40851419 -0.8513426
## med_low -0.08593699 -0.3062445 0.014751158 -0.5596622 -0.09990773 -0.2988540
## med_high -0.37236182 0.1748789 0.247665303 0.4275266 0.10915732 0.4092328
## high -0.48724019 1.0171960 0.008892378 1.0526481 -0.51206756 0.8086961
## dis rad tax ptratio black lstat
## low 0.8929290 -0.6859201 -0.7272231 -0.4124970 0.37503486 -0.75601314
## med_low 0.3249621 -0.5464108 -0.4873288 -0.1079282 0.31819207 -0.10354447
## med_high -0.3738188 -0.4152219 -0.3169415 -0.3389626 0.07030946 0.01218742
## high -0.8431601 1.6373367 1.5134896 0.7798552 -0.85019704 0.87037566
## medv
## low 0.47243983
## med_low 0.02791933
## med_high 0.20291716
## high -0.64019155
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.113503212 0.614241230 -0.95850535
## indus 0.009915225 -0.295065953 0.34711440
## chas -0.073279932 -0.028668696 0.08067394
## nox 0.348699532 -0.781682487 -1.40499020
## rm -0.116045005 -0.122275368 -0.12240229
## age 0.346232263 -0.257550970 -0.17307044
## dis -0.059974085 -0.191977831 0.03894738
## rad 3.186612801 0.883969835 -0.05898394
## tax 0.011301473 0.103864048 0.49423463
## ptratio 0.118258785 0.003979946 -0.35219502
## black -0.135422850 0.029387561 0.16490111
## lstat 0.211668705 -0.254536642 0.51578878
## medv 0.220606127 -0.392943826 -0.10372815
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9487 0.0395 0.0118
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
From the biplot and also quite easily from the coefficients list on the linear discriminats output we can see that rad variable, aka. index of accessibility to radial highways, has a very strong correlation for high crimes rates.
For the lower three categories of crime, it seems that area with larger plots sizes have lower crime rates. Also the nitrogen oxide concentration seems to a have similar but reverse effect (higher concentration, more crime).
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 5 2 0
## med_low 5 21 4 0
## med_high 0 6 13 1
## high 0 0 0 29
The generated model seems to predict higher crime rate areas better. The highest being clear a separate cluster in the biplot most likely explains this and the three other categories there is more variance in the predict-algorithms results.
# Need to redo scaling version for K-mean to work
bsca <- scale(Boston)
dist_euclidian <- dist(bsca)
summary(dist_euclidian)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
kc <- 6
km <- kmeans(bsca, centers = 4)
TWCSS <- sapply(1:kc, function(k){kmeans(Boston, k)$tot.withinss})
bsca <- as.data.frame(bsca)
pairs(bsca[1:5], col = km$cluster)
pairs(bsca[6:10], col = km$cluster)
pairs(bsca[11:14], col = km$cluster)
qplot(x = 1:kc, y = TWCSS, geom = 'line')
Looking the paired data or the more informative total of within cluster sum of squares plot we can see that clustering to two groups would capture most differences in the variables.
The following exercise is for dimensionality reduction techniques.
library("ggplot2")
library("GGally")
library("corrplot")
library("tidyverse")
library("vtable")
library("FactoMineR")
human <- read.csv("data/human.cvs")
rownames(human) <- human$X
human <- dplyr::select(human, -X)
The exercise is done by using the United Nations Development Programme data which is freely available. The data is available with explanations from the url below.
http://hdr.undp.org/en/content/human-development-index-hdi
We used only some of the variables available. Variables used were gross national income per capita (gnicap), secondary education proportion male/female (popseceduP), labour force proportion male/female (labrateP), life expectancy at birth (lifebirth), maternal mortality ratio (mmr), adolescent birth rate (abirth) and percentage of female representatives in parliament (prp).
# Summary statistics
st(human)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| popseceduP | 155 | 0.853 | 0.242 | 0.172 | 0.726 | 0.997 | 1.497 |
| labrateP | 155 | 0.707 | 0.199 | 0.186 | 0.598 | 0.854 | 1.038 |
| lifebirth | 155 | 71.654 | 8.332 | 49 | 66.3 | 77.25 | 83.5 |
| mmr | 155 | 149.084 | 211.79 | 1 | 11.5 | 190 | 1100 |
| abirth | 155 | 47.159 | 41.112 | 0.6 | 12.65 | 71.95 | 204.8 |
| prp | 155 | 20.912 | 11.488 | 0 | 12.4 | 27.95 | 57.5 |
| gnicap | 155 | 17627.903 | 18543.852 | 581 | 4197.5 | 24512 | 123124 |
ggpairs(human)
Only popsecedu2 of the variables looks like it might be standard distributed. Correlations are somewhat as expected, higher GNI countries have less maternal mortality and less adoslescent birth rate (fewer babies) but higher life expectancy at birth and proportion of females with secondary education. Also understandably there is a strong (negative) correlation between life expectancy and maternal mortality rate. Adolescent birth rate is negatively correlated with life expectancy at birth and positively with maternal mortality rate. Also secondary education proportion is correlated fairly strongly with life expectancy, maternal mortality rate and adoslescent birth rate.
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=7):
## [1] 1.854416e+04 1.855166e+02 2.518445e+01 1.144825e+01 3.734146e+00
## [6] 2.004983e-01 1.594657e-01
##
## Rotation (n x k) = (7 x 7):
## PC1 PC2 PC3 PC4
## popseceduP -5.607472e-06 0.0006714036 -3.514696e-05 -0.0002620825
## labrateP 2.331945e-07 -0.0002819674 5.278772e-04 -0.0046743437
## lifebirth -2.815823e-04 0.0283153792 1.290783e-02 -0.0671093745
## mmr 5.655734e-03 -0.9916605713 1.259324e-01 -0.0059171414
## abirth 1.233961e-03 -0.1255518557 -9.919261e-01 0.0059333537
## prp -5.526459e-05 0.0032309234 -7.516620e-03 -0.9976994508
## gnicap -9.999832e-01 -0.0057717465 -5.149855e-04 0.0000478912
## PC5 PC6 PC7
## popseceduP -0.0028489667 6.639035e-01 7.478125e-01
## labrateP 0.0014434209 7.478087e-01 -6.638960e-01
## lifebirth 0.9972547027 5.520834e-04 3.260802e-03
## mmr 0.0261271494 1.856764e-04 8.289187e-04
## abirth 0.0168039116 4.333395e-04 -2.525059e-04
## prp -0.0671404994 -3.713633e-03 2.688240e-03
## gnicap -0.0001085808 -1.913832e-06 -1.038360e-06
biplot(pca_human, choises = 1:2, cex =c(0.5, 0.8))
## Warning in plot.window(...): "choises" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "choises" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "choises" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "choises" is not a
## graphical parameter
## Warning in box(...): "choises" is not a graphical parameter
## Warning in title(...): "choises" is not a graphical parameter
## Warning in text.default(x, xlabs, cex = cex[1L], col = col[1L], ...): "choises"
## is not a graphical parameter
## Warning in plot.window(...): "choises" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "choises" is not a graphical parameter
## Warning in title(...): "choises" is not a graphical parameter
## Warning in axis(3, col = col[2L], ...): "choises" is not a graphical parameter
## Warning in axis(4, col = col[2L], ...): "choises" is not a graphical parameter
## Warning in text.default(y, labels = ylabs, cex = cex[2L], col = col[2L], :
## "choises" is not a graphical parameter
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Without standardization of the variables, the correlation naturally favors the gnicap as it has significantly larger value than any other variable. In fact the biplot above is mostly useless. Principal component 1 is dominated by the gnicap value as is also seen in the PCA table.
human_std <- scale(human)
human_std_pca <- prcomp(human_std)
human_std_pca
## Standard deviations (1, .., p=7):
## [1] 1.8853909 1.1277699 0.8729226 0.7773374 0.6563069 0.5217013 0.3229207
##
## Rotation (n x k) = (7 x 7):
## PC1 PC2 PC3 PC4 PC5
## popseceduP -0.39680835 0.08249781 -0.27749277 0.594867895 -0.5800454
## labrateP 0.08795767 0.71602210 -0.60013568 0.043499672 0.2974827
## lifebirth -0.48694386 0.01769349 0.09640197 -0.066425633 0.1837366
## mmr 0.48669893 0.09908190 -0.10775543 -0.244158612 -0.2012745
## abirth 0.45560671 0.03545242 0.03478382 0.062499058 -0.5411089
## prp -0.07908554 0.67973897 0.70365258 0.009966551 -0.1363359
## gnicap -0.38338546 0.08410849 -0.21352633 -0.759084126 -0.4351948
## PC6 PC7
## popseceduP 0.20526018 0.16211971
## labrateP -0.14842875 -0.08362821
## lifebirth -0.58711008 0.60861280
## mmr 0.31130461 0.73801103
## abirth -0.68893873 -0.13643399
## prp 0.13126423 -0.02561923
## gnicap 0.04859511 -0.17995312
biplot(human_std_pca, choises = 1:2, cex = c(0.5, 0.8), col = c("grey40", "deeppink2"))
## Warning in plot.window(...): "choises" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "choises" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "choises" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "choises" is not a
## graphical parameter
## Warning in box(...): "choises" is not a graphical parameter
## Warning in title(...): "choises" is not a graphical parameter
## Warning in text.default(x, xlabs, cex = cex[1L], col = col[1L], ...): "choises"
## is not a graphical parameter
## Warning in plot.window(...): "choises" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "choises" is not a graphical parameter
## Warning in title(...): "choises" is not a graphical parameter
## Warning in axis(3, col = col[2L], ...): "choises" is not a graphical parameter
## Warning in axis(4, col = col[2L], ...): "choises" is not a graphical parameter
## Warning in text.default(y, labels = ylabs, cex = cex[2L], col = col[2L], :
## "choises" is not a graphical parameter
Now we can much better see the results, and also now the results are more meaningful as the higher value of gni has replaced by standardized value. This allow us to more easily see how the principal component factors factor in. We can also see which variables have similar effect, aka. have correlation.
Clearly we now see that the Y-axis variables have mostly to do with healthcare issues (maternal mortality, birth rate, etc.) and GNI while the PCA2 shows workforce female proportion and parliament representation.
data(tea)
# Lots of variables in the data, select only a few for clarity
keep_columns <- c("Tea", "price", "age_Q", "frequency", "sex", "SPC")
# select the 'keep_columns' to create a new dataset
tea <- dplyr::select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea)
## Tea price age_Q frequency sex
## black : 74 p_branded : 95 15-24:92 1/day : 95 F:178
## Earl Grey:193 p_cheap : 7 25-34:69 1 to 2/week: 44 M:122
## green : 33 p_private label: 21 35-44:40 +2/day :127
## p_unknown : 12 45-59:61 3 to 6/week: 34
## p_upscale : 53 +60 :38
## p_variable :112
##
## SPC
## employee :59
## middle :40
## non-worker :64
## other worker:20
## senior :35
## student :70
## workman :12
# visualize the dataset
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
The Tea-dataset contains lots of variables. Here we use only some for the clarity of the output from the tools we use. I selected the six variables above for this excercise. Overall the data sems fairly nicely distributed, none of the variables have a single dominant value.
# multiple correspondence analysis
mca <- MCA(tea, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.340 0.304 0.227 0.220 0.199 0.196 0.191
## % of var. 9.712 8.684 6.481 6.294 5.680 5.602 5.469
## Cumulative % of var. 9.712 18.395 24.876 31.170 36.850 42.452 47.921
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.177 0.173 0.171 0.169 0.162 0.149 0.142
## % of var. 5.057 4.948 4.892 4.836 4.633 4.250 4.065
## Cumulative % of var. 52.977 57.926 62.818 67.654 72.287 76.537 80.602
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.134 0.126 0.115 0.113 0.091 0.054 0.045
## % of var. 3.841 3.587 3.292 3.235 2.605 1.551 1.286
## Cumulative % of var. 84.443 88.030 91.322 94.557 97.163 98.714 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.537 0.283 0.040 | 1.087 1.296 0.162 | 0.538 0.425
## 2 | 0.414 0.168 0.057 | 0.312 0.107 0.032 | 0.197 0.057
## 3 | 0.030 0.001 0.000 | 0.194 0.041 0.010 | 0.197 0.057
## 4 | -0.766 0.576 0.309 | 0.063 0.004 0.002 | 0.105 0.016
## 5 | 0.024 0.001 0.000 | 0.281 0.087 0.036 | -0.099 0.014
## 6 | -0.876 0.752 0.200 | 0.101 0.011 0.003 | 0.113 0.019
## 7 | 0.305 0.091 0.022 | 0.707 0.548 0.117 | 0.921 1.248
## 8 | 0.349 0.119 0.030 | 0.310 0.105 0.024 | 1.004 1.482
## 9 | 0.571 0.319 0.088 | 0.581 0.371 0.092 | 0.766 0.862
## 10 | 0.919 0.828 0.206 | 0.502 0.277 0.062 | 0.978 1.405
## cos2
## 1 0.040 |
## 2 0.013 |
## 3 0.010 |
## 4 0.006 |
## 5 0.004 |
## 6 0.003 |
## 7 0.199 |
## 8 0.250 |
## 9 0.159 |
## 10 0.233 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.796 7.655 0.207 7.871 | -0.202 0.550 0.013
## Earl Grey | -0.423 5.653 0.323 -9.831 | 0.060 0.126 0.006
## green | 0.692 2.583 0.059 4.206 | 0.103 0.064 0.001
## p_branded | -0.232 0.837 0.025 -2.733 | -0.151 0.395 0.011
## p_cheap | 0.200 0.046 0.001 0.534 | -0.929 1.105 0.021
## p_private label | -0.497 0.847 0.019 -2.357 | 0.156 0.094 0.002
## p_unknown | -0.280 0.154 0.003 -0.989 | 1.393 4.256 0.081
## p_upscale | 0.891 6.879 0.170 7.138 | -0.048 0.022 0.000
## p_variable | -0.114 0.238 0.008 -1.522 | 0.030 0.018 0.001
## 15-24 | -1.171 20.637 0.607 -13.472 | -0.447 3.356 0.088
## v.test Dim.3 ctr cos2 v.test
## black -1.996 | 0.581 6.116 0.111 5.748 |
## Earl Grey 1.386 | -0.024 0.028 0.001 -0.568 |
## green 0.628 | -1.160 10.866 0.166 -7.049 |
## p_branded -1.776 | -0.837 16.310 0.325 -9.856 |
## p_cheap -2.483 | 0.159 0.044 0.001 0.426 |
## p_private label 0.742 | 0.399 0.818 0.012 1.892 |
## p_unknown 4.917 | 0.434 0.554 0.008 1.533 |
## p_upscale -0.381 | 0.430 2.403 0.040 3.447 |
## p_variable 0.401 | 0.375 3.864 0.084 5.009 |
## 15-24 -5.137 | 0.195 0.857 0.017 2.243 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.324 0.014 0.232 |
## price | 0.184 0.107 0.327 |
## age_Q | 0.758 0.651 0.193 |
## frequency | 0.013 0.139 0.286 |
## sex | 0.072 0.206 0.001 |
## SPC | 0.689 0.707 0.322 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "ggplot")
My selected variables seem to be fairly equally distributed in the MCA dimensions. There are some slightly obviously related groups, like age 60+ and non-worker, perhaps surpricingly cheap price is also close to these. Students and young age also close by each other.
The final chapter of this course. This time we analyze longitunal data. The wranglng part consisted of taking dataset what was in wide-form (aka. single line for an observation, with time depended observation value as separate columns) and converted to long form where there are more observations (aka. lines) but with a time depended parameter. To achieve this one must separate the values that do not change between observations (aka. subject identifiers, group designators, etc.) from the time depended response variables.
The exercise uses to sets of data. Both are somewhat similar, with subject identifier and a group variable as factors that do not change between observations and columns with time in the columns names (weeks, WD) with the value of the response variable. These have been previously from this analysis been converted (in the data wrangling part) to longer form.
# Read datasets from file and initialize required libraries
library(tidyr)
library(dplyr)
library(vtable)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
BPRSL <- read.csv("data/bprsl.csv")
RATSL <- read.csv("data/ratsl.csv")
# Factor the categories again, just in case
BPRSL$subject <- factor(BPRSL$subject)
BPRSL$treatment <- factor(BPRSL$treatment)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# and remove the X variables
BPRSL <- dplyr::select(BPRSL, -X)
RATSL <- dplyr::select(RATSL, -X)
ggplot(RATSL, aes(x = Time, y = rats, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$rats), max(RATSL$rats)))
# Number of Days
n <- RATSL$Time %>% unique() %>% length()
# Summary data with mean and standard error of RATS by Group and Time
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(rats), se = sd(rats) / sqrt(n) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(rats) +/- se(rats)")
RATSL8S <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(rats) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(rats), WD 1-64")
## Warning: `fun.y` is deprecated. Use `fun` instead.
# Compute analysis of variance to see if differences in groups are significant
aov <- aov(mean ~ Group, data = RATSL8S)
summary(aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 236732 118366 88.07 2.76e-08 ***
## Residuals 13 17471 1344
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use pairwise comparisons to see which groups differ significantly.
TukeyHSD(aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mean ~ Group, data = RATSL8S)
##
## $Group
## diff lwr upr p adj
## 2-1 220.98864 161.71195 280.2653 0.0000006
## 3-1 262.07955 202.80286 321.3562 0.0000001
## 3-2 41.09091 -27.35592 109.5377 0.2864426
While there are some outliers evident from the previous graphs, I decided to leave them, as the boxplots still suggest there might be a statistically significant difference between at least Group 1 and Group 2, as well as between Group 1 and Group 2. I further used anova to analyze the means between groups, which proves that there is a significant difference between Groups. To further distinguish between which groups the difference exists, I used Tukey multiple pairwise-comparisons. The p value for pairwise comparisons between Groups 2 and 1 as well as 3 and 1 are significant. This nicely supports what the graphical boxplots suggest. However although the confidence intervals for group 2 and 3 do not mostly cross in the previous graphs, they do not seem to differ significantly in pairwise-comparisons.
Next we use the BPRS dataset to do a linear mixed effects model.
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
There does not seem to be a significant difference between the treatment groups. Let try to create a regression model.
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Treament group does not seem to be a significant explanatory variable, however weeks is significant.
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that with week the intercept model is slightly better (Chi-Square p of 0.02636). Now we create model that combines treatment with time (weeks) and compare that with the model that has just time (weeks) previously.
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It would seem that this model is not significantly better that the model with just time. This would seem to match with our previous assumptions from the graphs that the treatment does not have a significant difference in the bprs scores.
Finally lets try to create a version with fitted values and draw a graphs representing them.
Fitted <- fitted(BPRS_ref2)
BPRSL <- BPRSL %>% mutate(Fitted)
ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
Comparing these fitted graphs with the previous graphs with real values it seems the model does seem to fit quite nicely. however, there is negligble difference in the between groups. In fact most difference seems to be at intercept (aka week 0). This difference suggests that the participant randomization to two treatment groups has in fact failed.